126 ◾ Bioinformatics
window passes a threshold, then that window will be identified as an active region. A mea-
sure like entropy may be used to measure the activity on the region.
The haplotypes are constructed from the reassembled reads following the identifica-
tion of the active regions. The de Bruijn-like graph is used to reassemble the active region
and to identify the possible haplotypes present in the alignments. Once the haplotypes are
determined, the original alignment of the reads will be ignored and the candidate haplo-
types are realigned to the haplotypes of the reference genome using the Smith-Waterman
local alignment. The pairwise alignment is also performed using Pairwise Hidden Markov
Model (PairHMM) which generates a likelihood matrix of haplotypes given. These likeli-
hoods are then marginalized to obtain the likelihoods of alleles for each potentially variant
site given the read data. The genotype or the most likely pair of alleles is then determined
for each position. For a given genotype (Gi) on a subset of overlapped reads (Ri), the variant
callers then use Bayesian statistics to evaluate the posterior probability of the hypothetic
phenotype (Gi) as follows:
P G R
P R G
P G
P R
i
i
i
i
i
i
(
)
(
)
(
)
=
×
|
(
|
)
(4.1)
where the posterior probability P G R
i
i
(
)
|
is the probability of the phenotype (Gi) given that
subset of reads (Ri), P Gi
(
) is the prior probability that we expect to observe the genotype
based on previous observations, P Ri
(
) is the probability of the subset of the reads being
true (the probability of observing the evidence), and P R G
i
i
(
|
) is the probability of reads
given the genotype. The Bayesian variant caller writes the above formula as:
P G R
P R G
P G
P R R
P G
i
i
i
i
i
k
k
k
∑
(
)
(
)
(
)
=
×
×
|
(
|
)
( |
)
(4.2)
We can ignore the denominator because it is the same for all genotypes. Thus
P G R
P R G
P G
i
i
i
i
i
(
)
(
)
∝
×
|
(
|
)
(4.3)
The variant callers use a flat prior probability that can be changed by the users if the
probabilities of the genotypes are known based on previous observations. The important
probability in the above formula is P R G
i
i
(
|
), which can also be described in terms of the
likelihood of the hypothesis of the genotype (Gi) given the reads (Ri):
P R G
L G R
L R
H
L R
H
i
i
i
i
j
j
j
∏
(
)
(
)
(
)
(
)
=
=
+
|
|
|
2
|
2
1
2
(4.4)
where L R
H
j
(
)
|
1 and R
H
j
(
)
|
2 are the haplotype likelihoods.
The likelihoods of all possible genotypes are calculated based on the alleles that were
observed at the site, considering every possible combination of alleles. Then, the most likely